[GLUTEN-10568] [VL] Pass the table schema to the HiveTableHandle by kevinwilfong · Pull Request #10569 · apache/gluten

kevinwilfong · 2025-08-27T23:10:52Z

What changes are proposed in this pull request?

This PR changes the SubstraitToVeloxPlanConverter to pass the table schema rather than just the schema of the columns we
read from the files to the HiveTableHandle.

This is a necessary prerequisite to supporting index based column resolution, whether that's reading from files using column
positions rather than names to map between the file schema and the table schema, or reading from file formats that do not
contain schema information like Text.

To do this I updated VeloxIteratorApi to set the file schema of LocalFilesNodes to the data schema of the Scan (when
present). This is similar to what's already done in the Iterator API for ClickHouse. I then parse that schema and pass it to the
SplitInfo in VeloxPlanConverter. Finally, I extract it from the SplitInfo in SubstraitToVeloxPlanConverter and pass it to the
HiveTableHandle constructor in place of the base schema.

This should not produce any noticeable effect for the existing code paths/file formats as the table schema is a superset of
the base schema and file columns are currently mapped to table columns exclusively by name.

How was this patch tested?

Ran the existing unit tests. This change should not change any existing behavior, but should enable future changes.

github-actions · 2025-08-27T23:11:07Z

#10568

github-actions · 2025-08-27T23:11:21Z

Run Gluten Clickhouse CI on x86

Yohahaha · 2025-08-28T08:50:29Z

backends-velox/src/main/scala/org/apache/gluten/backendsapi/velox/VeloxIteratorApi.scala

    splitInfos.zipWithIndex.map {
-      case (splitInfos, index) =>
+      case (splits, index) =>
+        val splitsByteArray = splits.zipWithIndex.map {


could we inject table schema into substrait plan?
the draft implementation will bring more GC in driver.

When you say inject it into the substrait plan do you mean in the ReadRel?

If so definitely, this was something I was already considering I just wasn't sure how open folks are to updating the substrait protobufs, I borrowed this approach from what's already done in the code for ClickHouse.

We could probably deprecate the schema field in FileOrFiles. It looks like the only place it's used is that ClickHouse code path, and there it looks like it's put there because they only want to set it when the file is in the TextFile format, we could probably change that logic to only consume the field if it's in the TextFile format if that's not already the way it is.

When you say inject it into the substrait plan do you mean in the ReadRel?

yes, you can try modify gluten-substrait/src/main/resources/substrait/proto/substrait/algebra.proto

We could probably deprecate the schema field in FileOrFiles

+1

@Yohahaha I see now why they originally added it at the split level for ClickHouse

They only wanted to add the table schema if it was necessary, which today is only if a file is in the TextFileFormat. Given this is a property of the partition/split we don't know this at the time we're constructing the plan.

If we always set the table schema then it's possible the table schema has some column type that Gluten doesn't support (e.g. TimestampNTZ) which causes an exception serializing the ReadRelNode.

We can optionally add it to the substrait plan for the case where we want to support index based column resolution, which can be determined at plan generation time, however, to support file formats that always depend on knowing the schema like Text files, I think we'll need to keep it at the split level.

See the failure in the test "SPARK-36726: test incorrect Parquet row group file offset" in GlutenParquetIOSuite in gluten-ut/spark35 when I add it at the plan level

github-actions · 2025-08-29T18:53:08Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-08-29T18:56:09Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-08-29T18:58:19Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-08-29T20:34:32Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-08-29T21:08:22Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-08-29T23:11:09Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-09-02T16:43:46Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-10-18T01:59:22Z

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions · 2025-10-28T02:05:00Z

This PR was auto-closed because it has been stalled for 10 days with no activity. Please feel free to reopen if it is still valid. Thanks.

kevinwilfong marked this pull request as draft August 27, 2025 23:10

github-actions bot added CORE works for Gluten Core VELOX labels Aug 27, 2025

kevinwilfong force-pushed the data_schema branch from aca8302 to c1a3f30 Compare August 27, 2025 23:11

github-actions bot removed the CORE works for Gluten Core label Aug 27, 2025

kevinwilfong force-pushed the data_schema branch from c1a3f30 to 8a33c16 Compare August 28, 2025 00:18

Yohahaha reviewed Aug 28, 2025

View reviewed changes

kevinwilfong force-pushed the data_schema branch from 8a33c16 to e92e9ad Compare August 29, 2025 18:52

github-actions bot added CORE works for Gluten Core CLICKHOUSE labels Aug 29, 2025

kevinwilfong force-pushed the data_schema branch from e92e9ad to 1856d11 Compare August 29, 2025 18:55

kevinwilfong force-pushed the data_schema branch from 1856d11 to 2bb6797 Compare August 29, 2025 18:57

kevinwilfong force-pushed the data_schema branch from 2bb6797 to 501b454 Compare August 29, 2025 20:34

kevinwilfong force-pushed the data_schema branch from 501b454 to 5c0ae08 Compare August 29, 2025 21:07

github-actions bot added the DATA_LAKE label Aug 29, 2025

kevinwilfong marked this pull request as ready for review August 29, 2025 23:02

kevinwilfong marked this pull request as draft August 29, 2025 23:04

kevinwilfong force-pushed the data_schema branch from 5c0ae08 to d7ba2e9 Compare August 29, 2025 23:10

use table schema

7663b2f

kevinwilfong force-pushed the data_schema branch from d7ba2e9 to 7663b2f Compare September 2, 2025 16:43

kevinwilfong mentioned this pull request Sep 12, 2025

[VL] Support mapping columns by position index for ORC and Parquet files #10697

Merged

github-actions bot added the stale stale label Oct 18, 2025

github-actions bot closed this Oct 28, 2025

Conversation

kevinwilfong commented Aug 27, 2025

What changes are proposed in this pull request?

How was this patch tested?

Uh oh!

github-actions bot commented Aug 27, 2025

Uh oh!

github-actions bot commented Aug 27, 2025

Uh oh!

Yohahaha Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

kevinwilfong Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Yohahaha Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

kevinwilfong Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

kevinwilfong Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Aug 29, 2025

Uh oh!

github-actions bot commented Aug 29, 2025

Uh oh!

github-actions bot commented Aug 29, 2025

Uh oh!

github-actions bot commented Aug 29, 2025

Uh oh!

github-actions bot commented Aug 29, 2025

Uh oh!

github-actions bot commented Aug 29, 2025

Uh oh!

github-actions bot commented Sep 2, 2025

Uh oh!

github-actions bot commented Oct 18, 2025

Uh oh!

github-actions bot commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kevinwilfong Aug 28, 2025 •

edited

Loading